For more on how to manipulate tidy texts and dataframes: R for Data Science, chapter 5: http://r4ds.had.co.nz/ (Links to an external site.)
For an introduction to R programming: The Art of R Programming, chapters 1 and 2: https://ebookcentral.proquest.com/lib/fsu/detail.action?docID=1137514
how to transform your data using the dplyr package (and a new dataset on flights)
At this point, I was having trouble installing the tidyverse package. so I had to update my R version
To do so, I went http://mercury.webster.edu/aleshunas/R_learning_infrastructure/Updating%20R%20and%20RStudio.html for update instructions.
R version 3.2.3 (2015-12-10) – “Wooden Christmas-Tree” ^ This is my current version. As you can see, it was released around 2015. The current version available is R version 3.5.1 (Feather Spray) released 2018-07-02.
So even thoughI tried to install a new R package, it seems that the packages dpylr and tidytext provide the same function.
library(nycflights13)
library(tidyverse)
[30m── [1mAttaching packages[22m ─────────────────── tidyverse 1.2.1 ──[39m
[30m[32m✔[30m [34mggplot2[30m 3.0.0 [32m✔[30m [34mpurrr [30m 0.2.5
[32m✔[30m [34mtibble [30m 1.4.2 [32m✔[30m [34mdplyr [30m 0.7.6
[32m✔[30m [34mtidyr [30m 0.8.1 [32m✔[30m [34mstringr[30m 1.3.1
[32m✔[30m [34mreadr [30m 1.1.1 [32m✔[30m [34mforcats[30m 0.3.0[39m
[30m── [1mConflicts[22m ────────────────────── tidyverse_conflicts() ──
[31m✖[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31m✖[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
library(tidytext)
library(dplyr)
(Conflicting function calls) If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().
flights
filter() arrange() select() mutate() summarize()
group_by()
filter(flights, month == 1, day == 1)
jan1 <- filter(flights, month == 1, day == 1)
(dec25 <- filter(flights, month == 12, day == 25))
sqrt(2) ^ 2 == 2
[1] FALSE
1 / 49 * 49 == 1
[1] FALSE
near(sqrt(2) ^ 2, 2)
[1] TRUE
near(1 / 49 * 49, 1)
[1] TRUE
filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
NA > 5
[1] NA
10 == NA
[1] NA
NA + 10
[1] NA
NA / 2
[1] NA
NA == NA
[1] NA
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
[1] NA
is.na(x)
[1] TRUE
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
Find all flights
# Had an arrival delay of two or more hours
arrDelay2 <- filter(flights, arr_delay >= 120)
# Flew to Houston (IAH or HOU)
filter(flights, dest == "IAH"| dest == "HOU")
# Were operated by United, American, or Delta
filter(flights, carrier %in% c("UA", "AA", "DL"))
# Departed in summer (July, August, and September)
filter(flights, month %in% c(7, 8, 9))
# Arrived more than two hours late, but didn’t leave late
filter(flights, arr_delay > 2 & dep_delay <= 0)
# Were delayed by at least an hour, but made up over 30 minutes in flight
filter(flights, dep_delay >= 60 & sched_arr_time - arr_time > 30)
# Departed between midnight and 6am (inclusive)
filter(flights, dep_time >= 0 & dep_time <= 600)
Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
between() uses a upper and lower bound to note which of the list lies between these values
filter(flights, between(dep_time, 0, 600))
How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
# flights with missing departure times
# this is for flights that have been cancelled
nrow(filter(flights, is.na(dep_time)))
[1] 8255
Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
NA^0
[1] 1
NA | TRUE
[1] TRUE
FALSE & NA
[1] FALSE
arrange(flights, year, month, day)
arrange(flights, desc(dep_delay))
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
# how to use arrange() to sort all missing values to the start
arrange(flights, !is.na(dep_time), dep_time)
# i still don't really understand why this works. only that it does (somehow)
# oh got it now.
# Sort flights to find the most delayed flights.
arrange(flights, desc(dep_delay), desc(arr_delay))
# Find the flights that left earliest.
arrange(flights, dep_delay)
# Sort flights to find the fastest flights
arrange(flights, air_time)
# which flights travelled the longest
arrange(flights, desc(distance))
# which flights travelled the shortest
arrange(flights, distance)
# selecting specific columns
select(flights, year, month, day)
# select using : which means all columns in between
select(flights, year:day)
# select all columns except for those specified (-)
select(flights, -(year:day))
helper functions
starts_with(“abc”): matches names that begin with “abc”.
ends_with(“xyz”): matches names that end with “xyz”.
contains(“ijk”): matches names that contain “ijk”.
matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
num_range(“x”, 1:3): matches x1, x2 and x3.
select() has rename capabilities but its finicky at best, just use rename() instead
rename(flights, tail_num = tailnum)
# select dep_time, dep_delay, arr_time, and arr_delay from flights.
# select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, dep_time:arr_delay, -sched_dep_time, -sched_arr_time)
# same variable multiple times
select(flights, year, year)
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
# what you think it does: if it contains TIME (case sensitive or at the beginning of the column name) it will keep that column
# what it actually does: if it contains time anywhere (case insensitive) in the column name it will keep that column
select(flights, contains("TIME"))
# to change how select deals with case sensitivity, try
select(flights, starts_with("TIME"))
mutate() always adds new columns at the end of your dataset see all the columns is View()
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
sentiments
Examples from this chapter were already done in the previous homework assignment